2  Web Scraping

Published

August 16, 2023

Abstract
This section contains guidelines and processes for gathering information from the web using web scraping. We will focus on two approaches. First, we will learn to download many files. Second, we will learn to gather information from the bodies of websites.

2.1 Review

APIS

Some read_excel content in 02

2.2 Introduction and Motivation

The Internet is an immense source of information for research. Sometimes we can easily download data of interest in an ideal format with the click of a download button or a single API call.

But it probably won’t be long until we need data that require many download button clicks. Or worse, we may want data from web pages that don’t have a download button.

Consider a few examples.

We will explore two approaches for gathering information from the web.

  1. Iteratively downloading files: Sometimes websites contain useful information across many files that need to be separately downloaded. We will use code to download these files. Ultimately, these files can be combined into one larger data set for research.
  2. Scraping content from the body of websites: Sometimes useful information is stored as tables or lists in the body of websites. We will use code to scrape this information and then parse and clean the result. This is similar to using web APIs, only the information is not desired to be read by code.

Sometimes we download many PDF files using the first approach. A related method that we will not cover that is useful for gathering information from the web is extracting text data PDFs.

2.4 Programatically Downloading Data

*The County Health Rankings & Roadmaps is a source of state and local information.

Suppose we are interested in Injury Deaths at the state level. We can click through the interface and download a .xlsx file for each state.

  1. Start here.
  2. Using the interface at the bottom of the page, we can navigate to the page for “Virginia.”
  3. Next, we can click “View State Data.”
  4. Next, we can click “Download Virginia data sets.”

That’s a lot of clicks to get here.

If we want to download “2023 Virginia Data”, we can typically right click on the link and select “Copy Link Address”. This should return one of the following two URLS:

  • https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx
  • https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Virginia Data - v2.xlsx

If we plug that URL into a web browser it will automatically download the file. Alternatively, we can use download.file() to download the file provided we include a destfile.

download.file(
  url = "https://www.countyhealthrankings.org/sites/default/files/media/document/2023%20County%20Health%20Rankings%20Virginia%20Data%20-%20v2.xlsx", 
  destfile = "data/virginia-injury-deaths.xlsx"
)

If we poke around, we can see that all of the state data follows a common pattern. For example, the URL for Vermont is * https://www.countyhealthrankings.org/sites/default/files/media/document/2023 County Health Rankings Vermont Data - v2.xlsx

The names only differ by "Virginia" and "Vermont". Now we can iterate downloading the pages. We will only download data for two states, but we can imagine downloading data for many states or many counties.

  • A couple of tips:
    • paste0() and str_glue() from library(stringr) are useful for creating URLs and destination files.
    • walk() from library(purrr) can iterate functions. It’s like map(), but we use it when we are interested int he side-effect of a function.
    • Sometimes data are messy and we want to be polite. Custom functions can help with rate limiting and cleaning data.
  • Bonus: read_csv() can directly read .csvs from the Internet. However, I still like to download the data because the Internet is a moving target.
states <- c("Virginia", "Vermont")

urls <- paste0(
  "https://www.countyhealthrankings.org/sites/default/files/",
  "media/document/2023 County Health Rankings ",
  states,
  " Data - v2.xlsx"
)

output_files <- paste0("data/", states, ".csv")

download_chr <- function(url, destfil) {

  download.file(url, destfile)

  Sys.Sleep(0.5)

}

walk(.x = urls, .y = output_files, .f = download_chr)
Exercise 1

SOI Tax Stats - Historic Table 2 provides individual income and tax data, by state and size of adjusted gross income. The website contains a bulleted list of URLs and each URL downloads a .xlsx file.

  1. Use download.file() to download the file for Alabama.
  2. Explore the URLs using “Copy Link Address”.
  3. Iterate pulling the data for Alabama, Alaska, and Arizona.

  1. We are not lawyers. This is not official legal advise. If in-doubt, please contact a legal professional.↩︎

  2. This blog and this blog support this statement. Again, we are not lawyers and the HiQ Labs v. LinkedIn decision is complicated because of its long history and conclusion in settlement.↩︎

  3. The scale of crawling is so great that there is concern about models converging once all models use the same massive training data. Common Crawl is one example. This isn’t a major issue for generating image but model homogeneity is a big concern in finance.↩︎

  4. Every year, newspapers across the country FOIA information about government employees and publish their full names, job titles, and salaries.↩︎